INTERSPEECH.2012 - Language and Multimodal | Cool Papers

#1 Arabic dialect identification - "is the secret in the silence?" and other observations [PDF] [Copy] [Kimi¹]

Authors: Hynek Bořil ; Abhijeet Sangwan ; John H. L. Hansen

Conversational telephone speech (CTS) collections of Arabic dialects distributed trough the Linguistic Data Consortium (LDC) provide an invaluable resource for the development of robust speech systems including speaker and speech recognition, translation, spoken dialogue modeling, and information summarization. They are frequently relied on also in language (LID) and dialect identification (DID) evaluations. The first part of this study attempts to identify the source of the relatively high DID performance on LDCfs Arabic CTS corpora seen in recent literature. It is found that recordings of each dialect exhibit unique channel and noise characteristics and that silence regions are sufficient for performing reasonably accurate DID. The second part focuses on phonotactic dialect modeling that utilizes phone recognizers and support vector machines (PRSVM). New N-gram normalization of PRSVM input supervectors is introduced and shown to outperform the standard approach used in current LID and DID systems.

#2 The 2011 NIST language recognition evaluation [PDF] [Copy] [Kimi¹]

Authors: Craig S. Greenberg ; Alvin F. Martin ; Mark A. Przybocki

In 2011, NIST held the most recent in an ongoing series of Language Recognition Evaluations originating in 1996. The 2011 NIST Language Recognition Evaluation (LRE11) featured 24 languages, including nine languages new to the LRE series, from two different source types, and had participation from 23 research organizations. LRE11 utilized a new evaluation metric, which focused on difficult to distinguish language pairs. The most difficult pairs were generally contained within clusters of linguistically similar languages. For example, the Hindi/Urdu pair and the Lao/Thai pair both proved to be very challenging to distinguish. Pashto and Bengali were found to be confusable with a wide range of languages, and some progress was observed in distinguishing American English from Indian English.

#3 The BLZ submission to the NIST 2011 LRE: data collection, system development and performance [PDF] [Copy] [Kimi¹]

Authors: Luis Javier Rodríguez-Fuentes ; Mikel Penagarikano ; Amparo Varona ; Mireia Diez ; Germán Bordel ; Alberto Abad ; David Martínez ; Jesus Villalba ; Alfonso Ortega ; Eduardo Lleida

This paper describes the most relevant features of a collaborative multi-site submission to the NIST 2011 Language Recognition Evaluation (LRE), consisting of one primary and three contrastive systems, each fusing different combinations of 13 state-of-the-art (acoustic and phonotactic) language recognition subsystems. The collaboration focused on collecting and sharing training data for those target languages for which few development data were provided by NIST, and on defining a common development dataset to train backend and fusion parameters and select the best fusions. Official and post-key results are presented and compared, revealing that the greedy approach applied to select the best fusions provided suboptimal but very competitive performance. Several factors contributed to the high performance attained by BLZ systems, including the availability of training data for low resource target languages, the reliability of the development dataset (consisting only of data audited by NIST), the diversity of modeling approaches, features and datasets in the systems considered for fusion, and the effectiveness of the search for optimal fusions.

#4 Phonotactic language recognition using ivvectors and phoneme posteriogram counts [PDF] [Copy] [Kimi¹]

Authors: Luis Fernando D'Haro ; Ondřej Glembek ; Oldřich Plchot ; Pavel Matějka ; Mehdi Soufifar ; Ricardo Cordoba ; Jan Černocký

This paper describes a novel approach to phonotactic LID, where instead of using soft-counts based on phoneme lattices, we use posteriogram to obtain n-gram counts. The high-dimensional vectors of counts are reduced to low-dimensional units for which we adapted the commonly used term i-vectors. The reduction is based on multinomial subspace modeling and is designed to work in the total-variability space. The proposed technique was tested on the NIST 2009 LRE set with better results to a system based on using soft-counts (Cavg on 30s: 3.15% vs 3.43%), and with very good results when fused with an acoustic i-vector LID system (Cavg on 30s acoustic 2.4% vs 1.25%). The proposed technique is also compared with another low dimensional projection system based on PCA. In comparison with the original soft-counts, the proposed technique provides better results, reduces the problems due to sparse counts, and avoids the process of using pruning techniques when creating the lattices.

#5 Supervector LDA: a new approach to reduced-complexity i-vector language recognition [PDF] [Copy] [Kimi¹]

Authors: Alan McCree ; Bengt Borgström

In this paper, we extend our previous analysis of Gaussian Mixture Model (GMM) subspace compensation techniques using Gaussian modeling in the supervector space combined with additive channel and observation noise. We show that under the modeling assumptions of a totalvariability i-vector system, full Gaussian supervector scoring can also be performed cheaply in the total subspace, and that i-vector scoring can be viewed as an approximation to this. Next, we show that covariance matrix estimation in the i-vector space can be used to generate PCA estimates of supervector covariance matrices needed for Joint Factor Analysis (JFA). Finally, we derive a new technique for reduced-dimension i-vector extraction which we call Supervector LDA (SV-LDA), and demonstrate a 100-dimensional i-vector language recognition system with equivalent performance to a 600-dimensional version at much lower complexity.

#6 Patrol team language identification system for DARPA RATS P1 evaluation [PDF] [Copy] [Kimi¹]

Authors: Pavel Matějka ; Oldřich Plchot ; Mehdi Soufifar ; Ondřej Glembek ; Luis Fernando D'Haro ; Karel Veselý ; František Grézl ; Jeff Ma ; Spyros Matsoukas ; Najim Dehak

This paper describes the language identification (LID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We show that techniques originally developed for LID on telephone speech (e.g., for the NIST language recognition evaluations) remain effective on the noisy RATS data,provided that careful consideration is applied when designing the training and development sets. In addition, we show significant improvements from the use of Wiener filtering, neural network based i-vector, language dependent i-vector modeling, and fusion.

#7 Morpheme level feature-based language models for German LVCSR [PDF] [Copy] [Kimi¹]

Authors: Amr El-Desoky Mousa ; M. Ali Basha Shaik ; Ralf Schlüter ; Hermann Ney

One of the challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) of German is its complex morphology and high level of compounding. It leads to high Out-of-vocabulary (OOV) rates, and poor Language Model (LM) probabilities. In such cases, building LMs on morpheme level can be considered a better choice. Thereby, higher lexical coverage and lower LM perplexities are achieved. On the other side, a successful approach to improve the LM probability estimation is to incorporate features of words using feature-based LMs. In this paper, we use features derived for morphemes as well as words. Thus, we combine the benefits of both morpheme level and feature rich modeling. We compare the performance of stream-based, class-based and factored LMs (FLMs). Relative reductions of around 1.5% in Word Error Rate (WER) are achieved compared to the best previous results obtained using FLMs.

#8 Tied-state mixture language model for WFST-based speech recognition [PDF] [Copy] [Kimi¹]

Authors: Hitoshi Yamamoto ; Paul R. Dixon ; Shigeki Matsuda ; Chiori Hori ; Hideki Kashioka

This paper describes a language model combination method for automatic speech recognition (ASR) systems based on Weighted Finite-State Transducers (WFSTs). The performance of ASR in real applications often degrades when an input utterance is out of the domain of the prepared language models. To cover a wide range of domains, it is possible to utilize a combination of multiple language models. To do this, we propose a language model combination method with a two-step approach; it first uses a union operation to incorporate all components into a single transducer and then merges states of the transducer to mix n-grams included in multiple models and to retain unique n-grams in each model simultaneously. The method has been evaluated in speech recognition experiments on travel conversation tasks and has demonstrated improvements in recognition performance.

#9 Maximum entropy language model adaptation for mobile speech input [PDF] [Copy] [Kimi¹]

Authors: Tanel Alumäe ; Kaarel Kaljurand

This paper describes unsupervised adaptation of language model for many related target domains. In mobile speech input, subject and vocabulary of the language depend highly on the usage context. We use automatically transcribed speech data to select a subset from the language model training data for building a maximum entropy model adapted to speech input. This model is further adapted for most popular mobile applications. When used in interpolation with the background N-gram model, the adapted models give over 10% relative word error rate reduction in Estonian mobile speech input experiments.

#10 Supervised and unsupervised web-based language model domain adaptation [PDF] [Copy] [Kimi¹]

Authors: Gwénolé Lecorvé ; John Dines ; Thomas Hain ; Petr Motlicek

Domain language model adaptation consists in re-estimating probabilities of a baseline LM in order to better match the specifics of a given broad topic of interest. To do so, a common strategy is to retrieve adaptation texts from the Web based on a given domain-representative seed text. In this paper, we study how the selection of this seed text influences the adaptation process and the performances of resulting adapted language models in automatic speech recognition. More precisely, the goal of this original study is to analyze the differences of our Web-based adaptation approach between the supervised case, in which the seed text is manually generated, and the unsupervised case, where the seed text is given by an automatic transcript. Experiments were carried out on data sourced from a real-world use case, more specifically, videos produced for a university YouTube channel. Results show that our approach is quite robust since the unsupervised adaptation provides similar performance to the supervised case in terms of the overall perplexity and word error rate.

#11 A hierarchical Bayesian approach for semi-supervised discriminative language modeling [PDF] [Copy] [Kimi¹]

Authors: Yik-Cheung Tam ; Paul Vozila

Discriminative language modeling provides a mechanism for differentiating between competing word hypotheses, which are usually ignored in traditional maximum likelihood estimation of N-gram language models. Discriminative language modeling usually requires manual transcription which can be costly and slow to obtain. On the other hand, there are vast amount of untranscribed speech data on which offline adaptation technique can be applied to generate pseudo-truth transcription as an approximation to manual transcription. Viewing manual and pseudo-truth transcriptions as two domains, we perform hierarchical Bayesian domain adaptation on discriminative language models sharing a common prior model. Domain-specific and prior models are estimated jointly using training data. In the N-best list rescoring experiment, hierarchical Bayesian domain adaptation has yielded better recognition performance than the model trained only on manual transcription, and seems robust against inferior prior.

#12 Leveraging social annotation for topic language model adaptation [PDF] [Copy] [Kimi¹]

Authors: Youzheng Wu ; Kazuhiko Abe ; Paul R. Dixon ; Chiori Hori ; Hideki Kashioka

Social annotations such as Yahoo! Answers already define broad coverage of hierarchical topic categories and include millions of documents annotated by web users. This paper argues that topic language model (LM) adaptation via effective leveraging of such social annotations, while possibly noisy, may be more effective than unsupervised methods such as clustering-based and LDA-based algorithms. Experimental results on the IWSLT-2011 TED ASR data sets demonstrate that we can achieve modest improvements when compared with the unsupervised methods.

#13 LSTM neural networks for language modeling [PDF] [Copy] [Kimi¹]

Authors: Martin Sundermeyer ; Ralf Schlüter ; Hermann Ney

Neural networks have become increasingly popular for the task of language modeling. Whereas feed-forward networks only exploit a fixed context length to predict the next word of a sequence, conceptually, standard recurrent neural networks can take into account all of the predecessor words. On the other hand, it is well known that recurrent networks are difficult to train and therefore are unlikely to show the full potential of recurrent models. These problems are addressed by a the Long Short-Term Memory neural network architecture. In this work, we apply this type of network to an English and a large French language modeling task. Experiments show improvements of about 8% relative in perplexity over standard recurrent neural network LMs. In addition, we gain considerable improvements in WER on top of a state-of-the-art speech recognition system.

#14 Phrasal cohort based unsupervised discriminative language modeling [PDF] [Copy] [Kimi¹]

Authors: Puyang Xu ; Brian Roark ; Sanjeev Khudanpur

Simulated confusions enable the use of large text-only corpora for discriminative language modeling by hallucinating the likely recognition outputs that each (correct) sentence would be confused with. In [1], a novel approach was introduced to simulate confusions using phrasal cohorts derived directly from recognition output. However, the described approach relied on transcribed speech to derive cohorts. In this paper, we extend the phrasal cohort technique to the fully unsupervised scenario, where transcribed data are completely absent. Experimental results show that even if the cohorts are extracted from untranscribed speech, the unsupervised training can still achieve over 40% of the gains of the supervised approach. The results are presented on NIST data sets for a state-of-the-art LVCSR system.

#15 Deriving conversation-based features from unlabeled speech for discriminative language modeling [PDF] [Copy] [Kimi¹]

Authors: Damianos Karakos ; Brian Roark ; Izhak Shafran ; Kenji Sagae ; Maider Lehr ; Emily Prud'hommeaux ; Puyang Xu ; Nathan Glenn ; Sanjeev Khudanpur ; Murat Saraclar ; Dan Bikel ; Mark Dredze ; Chris Callison-Burch ; Yuan Cao ; Keith Hall ; Eva Hasler ; Philip Koehn ; Adam Lopez ; Matt Post ; Darcey Riley

The perceptron algorithm was used in [1] to estimate discriminative language models which correct errors in the output of ASR systems. In its simplest version, the algorithm simply increases the weight of n-gram features which appear in the correct (oracle) hypothesis and decreases the weight of n-gram features which appear in the 1-best hypothesis. In this paper, we show that the perceptron algorithm can be successfully used in a semi-supervised learning (SSL) framework, where limited amounts of labeled data are available. Our framework has some similarities to graph-based label propagation in the sense that a graph is built based on proximity of unlabeled conversations, and then it is used to propagate confidences (in the form of features) to the labeled data, based on which perceptron trains a discriminative model. The novelty of our approach lies in the fact that the confidence "flows" from the unlabeled data to the labeled data, and not vice-versa, as is done traditionally in SSL. Experiments conducted at the 2011 CLSP Summer Workshop on the conversational telephone speech corpora Dev04f and Eval04f demonstrate the effectiveness of the proposed approach.

#16 Performance comparison of training algorithms for semi-supervised discriminative language modeling [PDF] [Copy] [Kimi¹]

Authors: Erinç Dikici ; Arda Çelebi ; Murat Saraçlar

Discriminative language modeling (DLM) has been shown to improve the accuracy of automatic speech recognition (ASR) systems, but it requires large amounts of both acoustic and text data for training. One way to overcome this is to use simulated hypotheses instead of real hypotheses for training, which is called semi-supervised training. In this study, we compare six different perceptron algorithms with the semisupervised training approach. We formulate the DLM both as a structured prediction and a reranking problem, optimizing different criteria in each. We find that ranking variants achieve similar or better word error rate (WER) reduction with respect to structured perceptrons when used with real, simulated, or a combination of such data.

#17 On-the-fly topic adaptation for YouTube video transcription [PDF] [Copy] [Kimi¹]

Authors: Kapil Thadani ; Fadi Biadsy ; Dan Bikel

Automatic closed-captioning of video is a useful application of speech recognition technology but poses numerous challenges when applied to open-domain user-uploaded videos such as those on YouTube. In this work, we explore a strategy to improve decoding accuracy for video transcription by decoding each video with a language model (LM) adapted specifically to the topics that the video covers. Taxonomic topic classifiers are used to determine the topic content of videos and to build a large set of topic-specific LMs from web documents. We consider strategies for selecting and interpolating LMs in both supervised and unsupervised scenarios in a two-pass lattice rescoring framework. Experiments on a YouTube video corpus show a 10% relative reduction in WER over generic single-pass transcriptions as well as a statistically significant 2.5% reduction over rescoring with a very large non-adapted LM built from all the documents.

#18 Portability of semantic annotations for fast development of dialogue corpora [PDF] [Copy] [Kimi¹]

Authors: Bassam Jabaian ; Fabrice Lefèvre ; Laurent Besacier

Generalization of spoken dialogue systems increases the need for fast development of spoken language understanding modules for semantic tagging of speakerfs turns. Statistical methods are performing well for this task but require large corpora to be trained. Collecting such corpora is expensive in time and human expertise. In this paper we propose a semi automatic annotation process for fast production of dialogue corpora. The approach consists in automatically pre-annotating the corpus and then manually correct the annotation. To perform the preannotation we propose to port an existing corpus and to adapt it to the new data. The French MEDIA dialogue corpus is used as a starting point to produce two new corpora: one for a new language (Italian) and another for a new domain (theatre ticket reservation). We show that the automatic pre-annotation leads to a significant gain in productivity compared to a fully manual annotation and thus allow to derive new adaptation data which can be used to further improve the systems.

#19 Optimization of dialog strategies using automatic dialog simulation and statistical dialog management techniques [PDF] [Copy] [Kimi¹]

Authors: David Griol ; Zoraida Callejas ; Ramón López-Cózar

In this paper, we present a technique for learning optimal dialog management strategies. An automatic dialog generation technique, including a simulation of the communication channel, has been developed to acquire the required data, train dialog models, and explore new dialog strategies in order to learn the optimal one. A set of quantitative and qualitative measures has been defined to evaluate the quality of the strategies learned. We provide empirical evidence of the benefits of our proposal through its application to explore the space of possible dialog strategies for the UAH spoken dialog system.

#20 Preference-learning based inverse reinforcement learning for dialog control [PDF] [Copy] [Kimi¹]

Authors: Hiroaki Sugiyama ; Toyomi Meguro ; Yasuhiro Minami

Dialog systems that realize dialog control with reinforcement learning have recently been proposed. However, reinforcement learning has an open problem that it requires a reward function that is difficult to set appropriately. To set the appropriate reward function automatically, we propose preference-learning based inverse reinforcement learning (PIRL) that estimates a reward function from dialog sequences and their pairwise-preferences, which is calculated with annotated ratings to the sequences. Inverse reinforcement learning finds a reward function, with which a system generates similar sequences to the training ones. This indicates that current IRL supposes that the sequences are equally appropriate for a given task; thus, it cannot utilize the ratings. In contrast, our PIRL can utilize pairwise preferences of the ratings to estimate the reward function. We examine the advantages of PIRL through comparisons between competitive algorithms that have been widely used to realize the dialog control. Our experiments show that our PIRL outperforms the other algorithms and has a potential to be an evaluation simulator of dialog control.

#21 A data-driven approach to understanding spoken route directions in human-robot dialogue [PDF] [Copy] [Kimi¹]

Authors: Raveesh Meena ; Gabriel Skantze ; Joakim Gustafson

In this paper, we present a data-driven chunking parser for automatic interpretation of spoken route directions into a route graph that is useful for robot navigation. Different sets of features and machine learning algorithms are explored. The results indicate that our approach is robust to speech recognition errors.

#22 Detecting system-directed utterances using dialogue-level features [PDF] [Copy] [Kimi¹]

Authors: Kazunori Komatani ; Akira Hirano ; Mikio Nakano

We have developed a method to determine whether a user utterance is directed at the system or not. A spoken dialogue system should not respond to audio inputs that are not directed at it (i.e., a userfs mutter), and it therefore needs to detect such inputs to avoid unsuitable responses. We classify the two cases by logistic regression based on a feature set including utterance timing, utterance length, and dialogue status. We conducted experiments using 5395 user utterances for both transcription and automatic speech recognition results. Results showed that the classification accuracy improved by 11.0 and 4.1 points, respectively. We also discuss which features are effective in the classification.

#23 An online generated transducer to increase dialog manager coverage [PDF] [Copy] [Kimi¹]

Authors: Joaquin Planells ; Lluís-F. Hurtado ; Emilio Sanchis ; Encarna Segarra

This paper presents a new approach for dynamically increasing the coverage of a Statistical Dialog Manager. A Stochastic Finite-State Transducer for dialog management is estimated using a dialog simulator. This corpus-based model can cover most typical user behavior; however, sometimes unexpected situations may arise. Whenever these situations occur, the Dialog Manager model has no information to determine the next action. To deal with this problem, an Online Dialog Simulator is used in order to obtain synthetic dialogs for re-estimating the model and allowing it the dialog to continue. This approach has been evaluated with real users in a sport facilities booking task.

#24 A sequential Bayesian dialog agent for computational ethnography [PDF] [Copy] [Kimi¹]

Authors: Abe Kazemzadeh ; James Gibson ; Juanchen Li ; Sungbok Lee ; Panayiotis G. Georgiou ; Shrikanth Narayanan

We present an sequential Bayesian belief update algorithm for an emotional dialog agentfs inference and behavior. This agentfs purpose is to collect usage patterns of natural language description of emotions among a community of speakers, a task which can be seen as a type of computational ethnography. We describe our target application, an emotionally-intelligent agent that can ask questions and learn about emotions through playing the emotion twenty questions (EMO20Q) game. We formalize the agentfs algorithms mathematically and algorithmically and test our model experimentally in an experiment of 45 human-computer dialogs with a range of emotional words as the independent variable. We found that (44%) of these dialog games are completed successfully, in comparison with earlier work in which human-human dialogs resulted in 85% successful completion on average. Despite lower than human performance, especially on difficult emotion words, the subjects rated that the agentfs humanity was 6.1 on a 0 to 10 scale. This indicates that the algorithm we present produces realistic behavior, but that issues of data sparsity may remain.

#25 Clippyscript: a programming language for multi-domain dialogue systems [PDF] [Copy] [Kimi¹]

Authors: Frank Seide ; Sean McDirmid

The past year has witnessed the revival of spoken dialogue systems, which are becoming multi-domain and ubiquitous. In this context, the problem of efficient scripting of dialogues is becoming increasingly important. As of today, statistical approaches to dialogue control are not yet feasible; the problem remains quite firmly one of manual coding. This paper describes a programming language we christened ClippyScript, which is aimed at rapid manual scripting of multi-domain dialogue systems. Only four main keywords - MODULE, SLOTS, GRAMMAR, and ACTIONS - plus a concept of focus provide the necessary abstractions for language understanding, dynamic grammars, hierarchical slot filling, multiple domains, access to data services, and performing of the dialogue goal. The languagefs expressive power is boosted by the ability of embedding code snippets in a high-level programming language (C#).